Biostat 203B Homework 1

Due Jan 26, 2024 @ 11:59PM

Author

Jiyin (Jenny) Zhang, UID: 606331859

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.33     rlang_1.1.2       evaluate_0.23    

Q1. Git/GitHub

No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.

  1. Apply for the Student Developer Pack at GitHub using your UCLA email. You’ll get GitHub Pro account for free (unlimited public and private repositories).

  2. Create a private repository biostat-203b-2024-winter and add Hua-Zhou and TA team (Tomoki-Okuno for Lec 1; jonathanhori and jasenzhang1 for Lec 80) as your collaborators with write permission.

  3. Top directories of the repository should be hw1, hw2, … Maintain two branches main and develop. The develop branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The main branch will be your presentation area. Submit your homework files (Quarto file qmd, html file converted by Quarto, all code and extra data sets to reproduce results) in the main branch.

  4. After each homework due date, course reader and instructor will check out your main branch for grading. Tag each of your homework submissions with tag names hw1, hw2, … Tagging time will be used as your submission time. That means if you tag your hw1 submission after deadline, penalty points will be deducted for late submission.

  5. After this course, you can make this repository public and use it to demonstrate your skill sets on job market.

Answer: Display the URL of my GitHub repository here. https://github.com/Zhangjiyin2000/biostat-203b-2024-winter

Q2. Data ethics training

This exercise (and later in this course) uses the MIMIC-IV data v2.2, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic.mit.edu/docs/gettingstarted/ to (1) complete the CITI Data or Specimens Only Research course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. You must complete Q2 before working on the remaining questions. (Hint: The CITI training takes a few hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)

Answer: I completed the CITI training. Here is the link to my completion report. Here is the link to my completion certificate.

Q3. Linux Shell Commands

  1. Make the MIMIC v2.2 data available at location ~/mimic.
ls -l ~/mimic/

Refer to the documentation https://physionet.org/content/mimiciv/2.2/ for details of data files. Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files and are not big-data-friendly practices. Read from the data folder ~/mimic directly in following exercises.

Use Bash commands to answer following questions.

Answer: I created a symbolic link mimic to my MIMIC data folder. Here is the output of ls -l ~/mimic/:

ls -l ~/mimic/
total 48
-rw-rw-r--@  1 zhangjiyin  staff  13332 Jan  5  2023 CHANGELOG.txt
-rw-rw-r--@  1 zhangjiyin  staff   2518 Jan  5  2023 LICENSE.txt
-rw-rw-r--@  1 zhangjiyin  staff   2884 Jan  6  2023 SHA256SUMS.txt
drwxr-xr-x@ 24 zhangjiyin  staff    768 Jan  5 23:41 hosp
drwxr-xr-x@ 11 zhangjiyin  staff    352 Jan  5 23:41 icu
lrwxr-xr-x   1 zhangjiyin  staff     61 Jan 24 22:46 mimic-iv-2.2 -> /Users/zhangjiyin/Desktop/ucla/23-24/winter/203B/mimic-iv-2.2

Here is how I created the symbolic link:

# ln -s /Users/zhangjiyin/Desktop/ucla/23-24/winter/203B/mimic-iv-2.2 ./mimic
  1. Display the contents in the folders hosp and icu using Bash command ls -l. Why are these data files distributed as .csv.gz files instead of .csv (comma separated values) files? Read the page https://mimic.mit.edu/docs/iv/ to understand what’s in each folder.

  2. Briefly describe what Bash commands zcat, zless, zmore, and zgrep do.

  3. (Looping in Bash) What’s the output of the following bash script?

for datafile in ~/mimic/hosp/{a,l,pa}*.gz
do
  ls -l $datafile
done

Display the number of lines in each data file using a similar loop. (Hint: combine linux commands zcat < and wc -l.)

  1. Display the first few lines of admissions.csv.gz. How many rows are in this data file? How many unique patients (identified by subject_id) are in this data file? Do they match the number of patients listed in the patients.csv.gz file? (Hint: combine Linux commands zcat <, head/tail, awk, sort, uniq, wc, and so on.)

  2. What are the possible values taken by each of the variable admission_type, admission_location, insurance, and ethnicity? Also report the count for each unique value of these variables. (Hint: combine Linux commands zcat, head/tail, awk, uniq -c, wc, and so on; skip the header line.)

  3. To compress, or not to compress. That’s the question. Let’s focus on the big data file labevents.csv.gz. Compare compressed gz file size to the uncompressed file size. Compare the run times of zcat < ~/mimic/labevents.csv.gz | wc -l versus wc -l labevents.csv. Discuss the trade off between storage and speed for big data files. (Hint: gzip -dk < FILENAME.gz > ./FILENAME. Remember to delete the large labevents.csv file after the exercise.)

Q5. More fun with Linux

Try following commands in Bash and interpret the results: cal, cal 2024, cal 9 1752 (anything unusual?), date, hostname, arch, uname -a, uptime, who am i, who, w, id, last | head, echo {con,pre}{sent,fer}{s,ed}, time sleep 5, history | tail.

Answer: Here is the output of the commands:

cal
    January 2024      
Su Mo Tu We Th Fr Sa  
    1  2  3  4  5  6  
 7  8  9 10 11 12 13  
14 15 16 17 18 19 20  
21 22 23 24 _2_5 26 27  
28 29 30 31           
                      

cal: display the calendar of the current month.

cal 2024
                            2024
      January               February               March          
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  
    1  2  3  4  5  6               1  2  3                  1  2  
 7  8  9 10 11 12 13   4  5  6  7  8  9 10   3  4  5  6  7  8  9  
14 15 16 17 18 19 20  11 12 13 14 15 16 17  10 11 12 13 14 15 16  
21 22 23 24 _2_5 26 27  18 19 20 21 22 23 24  17 18 19 20 21 22 23  
28 29 30 31           25 26 27 28 29        24 25 26 27 28 29 30  
                                            31                    

       April                  May                   June          
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  
    1  2  3  4  5  6            1  2  3  4                     1  
 7  8  9 10 11 12 13   5  6  7  8  9 10 11   2  3  4  5  6  7  8  
14 15 16 17 18 19 20  12 13 14 15 16 17 18   9 10 11 12 13 14 15  
21 22 23 24 25 26 27  19 20 21 22 23 24 25  16 17 18 19 20 21 22  
28 29 30              26 27 28 29 30 31     23 24 25 26 27 28 29  
                                            30                    

        July                 August              September        
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  
    1  2  3  4  5  6               1  2  3   1  2  3  4  5  6  7  
 7  8  9 10 11 12 13   4  5  6  7  8  9 10   8  9 10 11 12 13 14  
14 15 16 17 18 19 20  11 12 13 14 15 16 17  15 16 17 18 19 20 21  
21 22 23 24 25 26 27  18 19 20 21 22 23 24  22 23 24 25 26 27 28  
28 29 30 31           25 26 27 28 29 30 31  29 30                 
                                                                  

      October               November              December        
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  
       1  2  3  4  5                  1  2   1  2  3  4  5  6  7  
 6  7  8  9 10 11 12   3  4  5  6  7  8  9   8  9 10 11 12 13 14  
13 14 15 16 17 18 19  10 11 12 13 14 15 16  15 16 17 18 19 20 21  
20 21 22 23 24 25 26  17 18 19 20 21 22 23  22 23 24 25 26 27 28  
27 28 29 30 31        24 25 26 27 28 29 30  29 30 31              
                                                                  

cal 2024: display the calendar of the year 2024.

cal 9 1752
   September 1752     
Su Mo Tu We Th Fr Sa  
       1  2 14 15 16  
17 18 19 20 21 22 23  
24 25 26 27 28 29 30  
                      
                      
                      

cal 9 1752: display the calendar of the month September in the year 1752. The calendar of September 1752 is unusual because the Gregorian calendar was adopted in the British Empire in September 1752. The calendar was changed from the Julian calendar to the Gregorian calendar. The Julian calendar was 11 days behind the Gregorian calendar. So the 11 days from September 3 to September 13 were skipped.

date
Thu Jan 25 09:58:22 PST 2024

date: display the current date and time.

hostname
zhangjiyindeAir.lan

hostname: display the name of the host.

arch
arm64

arch: display the machine hardware name.

uname -a
Darwin zhangjiyindeAir.lan 21.6.0 Darwin Kernel Version 21.6.0: Thu Mar  9 20:10:19 PST 2023; root:xnu-8020.240.18.700.8~1/RELEASE_ARM64_T8101 arm64

uname -a: display the system information.

uptime
 9:58  up 2 days,  8:53, 2 users, load averages: 1.78 1.45 1.16

uptime: display the current time, how long the system has been running, how many users are currently logged on, and the system load averages for the past 1, 5, and 15 minutes.

who am i
zhangjiy tty??    Jan 25 09:58 

who am i: display the current user.

who
zhangjiyin console  Jan 23 01:06 
zhangjiyin ttys000  Jan 24 00:52 

who: display the users who are currently logged in.

# w 

w: display the users who are currently logged in and what they are doing.

id
uid=501(zhangjiyin) gid=20(staff) groups=20(staff),12(everyone),61(localaccounts),79(_appserverusr),80(admin),81(_appserveradm),98(_lpadmin),33(_appstore),100(_lpoperator),204(_developer),250(_analyticsusers),395(com.apple.access_ftp),398(com.apple.access_screensharing),399(com.apple.access_ssh),400(com.apple.access_remote_ae),701(com.apple.sharepoint.group.1)

id: display the user and group information for the current user.

last | head
zhangjiyin  ttys000                   Wed Jan 24 00:52   still logged in
zhangjiyin  console                   Tue Jan 23 01:06   still logged in
reboot    ~                         Tue Jan 23 01:05 
zhangjiyin  console                   Mon Jan 22 14:51 - 01:04  (10:13)
reboot    ~                         Mon Jan 22 14:42 
shutdown  ~                         Mon Jan 22 14:42 
zhangjiyin  ttys000                   Thu Jan 18 10:16 - 10:16  (00:00)
zhangjiyin  ttys000                   Thu Jan 18 10:11 - 10:11  (00:00)
zhangjiyin  ttys000                   Thu Jan 18 10:10 - 10:10  (00:00)
zhangjiyin  ttys000                   Thu Jan 18 09:10 - 09:10  (00:00)

last | head: display the last logged in users.

echo {con,pre}{sent,fer}{s,ed}
consents consented confers confered presents presented prefers prefered

echo {con,pre}{sent,fer}{s,ed}: display the words “consents”, “confer”, “presents”, “present”, “consented”, “conferred”, “presented”, “presented”.

time sleep 5

real    0m5.007s
user    0m0.000s
sys 0m0.001s

time sleep 5: display the time it takes to run the command sleep 5.

history | tail

history | tail: display the last 10 commands in the history.

Q6. Book

  1. Git clone the repository https://github.com/christophergandrud/Rep-Res-Book for the book Reproducible Research with R and RStudio to your local machine.

  2. Open the project by clicking rep-res-3rd-edition.Rproj and compile the book by clicking Build Book in the Build panel of RStudio. (Hint: I was able to build git_book and epub_book but not pdf_book.)

The point of this exercise is (1) to get the book for free and (2) to see an example how a complicated project such as a book can be organized in a reproducible way.

For grading purpose, include a screenshot of Section 4.1.5 of the book here.

Answer:

I was also able to build git_book and epub_book but not pdf_book. Here is the screenshot of Section 4.1.5 of the git_book. Section 4.1.5 of the git_book Here is the screenshot of Section 4.1.5 of the epub_book. Section 4.1.5 of the epub_book